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Abstract 

Minimal cost feature selection is devoted to obtain a trade-off between test 
costs and misclassification costs. This issue has been addressed recently on nom- 
inal data. In this paper, we consider numerical data with measurement errors 
and study minimal cost feature selection in this model. First, we build a data 
model with normal distribution measurement errors. Second, the neighborhood 
of each data item is constructed through the confidence interval. Comparing 
with discretized intervals, neighborhoods are more reasonable to maintain the 
information of data. Third, we define a new minimal total cost feature selection 
problem through considering the trade-off between test costs and misclassifica- 
tion costs. Fourth, we proposed a backtracking algorithm with three effective 
pruning techniques to deal with this problem. The algorithm is tested on four 
UCI data sets. Experimental results indicate that the pruning techniques are 
effective, and the algorithm is efficient for data sets with nearly one thousand 
objects. 

Keywords: feature selection; normal distribution measurement errors; test 
costs; misclassification costs. 



1. Introduction 

Minimal cost feature selection is devoted to obtain a trade-off between test 
costs and misclassification costs. Test costs are what we pay for collecting data 
items pp. Test costs are often measured by time, money, and other resources. 
When test costs are only considered (see, e.g., [21 El HI Hj), the minimal cost 
feature selection problem degrades to the minimal test cost reduct problem [T] . 
Therefore, the minimal cost feature selection problem is a generalization of the 
minimal test cost reduct problem. 

In addition to the test costs, misclassification costs are necessary to be con- 
sidered in cost-sensitive learning [BJ. Misclassification cost is the penalty we 
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receive while deciding that an object belongs to class J when its real class is K 
0[S]. For example, in medical diagnosis, if a cancer (non-cancer) is regarded as 
the negative (positive) class, there will be punishment. When misclassification 
costs are only considered (see, e.g., [HI IS] ) , the average misclassification cost is 
a more general metric than the accuracy |10j . 

It is important to consider both test costs and misclassification costs in many 
applications [5]. Test costs are paid on each object. While misclassification costs 
are paid to misclassified objects. Therefore we should take into account total 
cost on a set of objects through considering the trade-off between test costs and 
misclassification costs. This issue has been addressed recently on nominal data. 
In real applications, however, data are often numeric and they always have 
some measurement errors. The measurement errors of the data have certain 
universality and are inescapability. 

In this paper, we study minimal cost feature selection considering numerical 
data with measurement errors. First, a data model with measurement errors 
under the normal distribution is defined. Second, we construct the neighborhood 
of each data item through the confidence interval. Compared with discretized 
intervals, neighborhoods are more reasonable to maintain the information of 
data. Third, considering the trade-off between test costs and misclassification 
costs, we define a new minimal total cost feature selection problem. Fourth, a 
backtracking algorithm with three effective pruning techniques is proposed to 
deal with this problem. 

Four open data sets from the UCI (University of California - Irvine) library 
are employed to study the efficiency and effect of our algorithm. Experiments 
are undertaken with open source software Coser [llj to validate the performance 
of this algorithm. Experimental results indicate that the pruning techniques are 
effective, and the algorithm is efficient for data sets with nearly one thousand 
objects. 

This paper is organized as follows. Section [2] introduces a decision system 
with normal distribution measurement errors. Then we introduce test costs 
and misclassification costs to this decision system, and define a cost-sensitive 
decision system. In Section [3j we define a new minimal cost feature selection 
problem by considering test costs and misclassification costs. In order to deal 
with this problem, a full description of a backtracking algorithm is given in 
Section [4] Experimental settings and results are discussed in Section [5] Finally, 
Section [6] concludes and suggests further research trends. 

2. Data models 

In this section, the concept of decision systems with normal distribution 
measurement errors (NDME) is revisited. Then the neighborhood of each data 
item is constructed through the confidence interval. Finally decision systems 
based on NDME with test costs and misclassification costs is presented. 
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Definition 1. [12] A decision system with normal distribution measurement 
errors (NEDS) S is the 6-tuple: 

S=(U,C,d,V = {V a \a£CU{d}},I = {I a \a€CU{d}},n), (1) 

where U is the nonempty set called a universe, C is the nonempty set of con- 
ditional attributes, d is the decision attribute, V a is the set of values for each 
a G C U {d}, and I a : U — > V a is an information function for each a G C U {d}. 
n : C — > R + U {0} is the maximal measurement error of a G C , and ±n(a) are 
confidence limits of a, respectively. 

Confidence limits are the lower and upper boundaries of a confidence interval. 
In real applications, there are a number of measurement methods to obtain 
a numerical data item with different measurement errors. The measurement 
errors often satisfy normal distribution which is found to be applicable over 
almost the whole of science and engineering measurement. We introduce the 
confidence interval of normal distribution to our model. With Definition [T] a 
new neighborhood is defined as follows. 

Definition 2. \Wj Let S = (U, C, d, V, I, n) be a NEDS. Given x 4 G U and 
B C. C, the neighborhood of Xi with respect to normal distribution measurement 
errors on test set B is defined as 

n B (xi) = {xe U\ia G B, \a(x) - a{x % )\ < 2n(a)}, (2) 

it represents the error value of a in [—n(a),+n(a)]. 

From Definition [2] we know that 

nB{xi) = P| n {a} (xi). (3) 

aEB 

That is, the neighborhood ns(xj) is the intersection of a number of basic neigh- 
borhoods (see, e.g., [H El [HI US]). Given Va; G U, Va G B, x G n B (x). 
Therefore, for any B C C, [j xeU n B {x) — U. Hence the set {nB{xi)\xi G U} is 
a covering (see, e.g., [I3Q3QJ]) of U. 

In cost-sensitive learning, test costs and misclassification costs are two most 
important types of costs. Now, we define a cost-sensitive decision system by 
considering both test and misclassification costs. 

Definition 3. A decision system based on NDME with test costs and misclas- 
sification costs (NEDS-TM) S is the 8-tuple: 

S = (U,C,d,V,I,n,tc, mc) , (4) 

where U, C, d, V, I and n have the same meanings as in Definition [7| tc : C — > 
R + U {0} is the test costs function and mc : k x k — > IR + U {0} is the misclassi- 
fication costs function, where k = 
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For any B C C, the sequence-independent test costs function tc is defined 
as follows: 

fc(5) = £fc(a). (5) 

aeB 

The misclassification costs function can be represented by a matrix mc = 
{mcfexfe}. Misclassification costs [2Ql [2TJ [22] is the penalty we receive while 
deciding that an object belongs to class m when its real class is n [TJ [23] . If 
classification is correct, the misclassification costs mc[m,m] = 0. The following 
example gives us intuitive understanding. 



Table 1: An example numerical value attribute decision table. 



X 




a.2 


as 


d 


Xl 


0.31 


0.23 


0.08 


y 


X2 


0.14 


0.38 


0.23 


y 


X3 


0.25 


0.40 


0.40 


y 


X4 


0.60 


0.46 


0.51 


n 


X 5 


0.41 


0.64 


0.62 


n 


X 6 


0.35 


0.50 


0.75 


n 



Table 2: A neighborhood boundaries vector and a test costs vector. 



a a\ a,2 as 

n(a) (L0069 (h0087^ 0.0086 

c(a) $28 $19 $56 



Example 4. A NEDS-TM is given by Tables^ and\^ and 



mc 




200 



800 




(6) 



That is, the test costs are $28, $19 and $56, respectively. In this data set, the 
decision attribute is used to split data set into two sets. In case that a person 
belongs to set "y" , and he is misclassified as set "n", a penalty of $800 is paid, 
contrarily $200 is paid. 



3. Minimal cost feature selection problem 

In this work, we focus on selecting a subset of features to minimize the total 
cost, that is, cost-sensitive feature selection based on test costs and misclassi- 
fication costs. The problem of finding such a subset of features is called the 
minimal cost feature selection problem. 
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Problem 5. The minimal cost feature selection problem (MCFS). 
Input: S — (U, C, d, V, I, n, tc, mc); 
Output: R C C and the average total cost (ATC); 
Optimization objective: min\ATC'(R)\. 

Compared with the classical minimal reduction problem, there are several 
differences as follows. The first is the input, the external information are the 
test costs and misclassification costs as well as normal distribution measurement 
errors. The second is the optimization objective, which is to minimize the total 
cost instead of the number of features. In data mining applications, average 
total cost (ATC) is considered to be a more general metric than the accuracy 
[9]. The average total cost is computed as follows. 

Step 1. Compute the neighborhood of each data item. Let B be a 
selected feature set, and N{B) be a set {riB(xi)\xi E U}. 

Step 2. Assign one class. Let U' — nsixi) and CB(xi)(d m ) be the number 
of class d m of U' , where d m G {Id}- We adjust different classes of elements in 
U' to one class for minimizing the misclassification cost of U' . U' includes the 
following two types of cases. 

1. U' C POS B ({d}) if and only if d(x) = d(y) for any x,y E U' . In this 
case, the misclassification costs of V is mc(U',B) = 0. For any x E U' , 
the assigning class d'(x) — d(x). 

2. However, if there exists x, y E U' st. d(x) ^ d(y), we may adjust different 
classes of elements in U' to one class. Let d(x) be m-class and d(y) be 
n-class. We select one case to minimize the misclassification costs of U' . 

mc(U',B) = min(TOc m ,„ x \U' m \,mc n ^ m x \U' n \). (7) 

For any x E V, 

,,, > _ / n-class If mc{U',B) = mc m . n x \U m \, , , 

a[X> \ m-class Umc(U',B)=mc n . m x \U^\, [ > 

where mc m ^ n is the cost of classifying an object of the m-class into the 

n-class, and \U' m \ is the number of m-class. 
Step 3. Compute average misclassification cost. The class of Xi is 
d'(xi) — d m if and only if max{cB{xi){d m )\d m E {Id}}- The misclassification 
costs are 

f Ud(xi)=d>( Xi ), 
mc*(d(xi),d'(xi)) = < mc m n If d(xi) = m and d'(xi) — n, (9) 
I fncn^m If d(xi) = n and d'(xi) = m. 

In this way, the average misclassification cost is given by 

m m T, Xi eu m ' c *i. d '{x i ),d'{x i )) 

mc(U, B) = — . (10) 

Step 4. Compute average total cost (ATC). The average total cost is 
given by 

ATC(U,B) =tc(B) +mc(U,B). (11) 
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In this context, we select a feature subset in order to minimize the average 
total cost. The minimal average total cost is given by 

ATC(U, B) = min{ATC(U, B')\B' C C}. (12) 

The MCFS problem has a similar motivation with the cheapest cost problem 
[24] . or the minimal test cost reduct (MTR) problem (see, e.g., [1]). However, 
compared with the MTR problem, our MCFS problem is different from theirs 
in two aspects. 

1. In addition to considering the test costs of each attribute, we take mis- 
classification cost into account. When the misclassification costs are too 
large compared with test costs, the MCFS problem coincides with the 
MTR problem. Therefore the MCFS problem is a generalization of the 
MTR problem. 

2. The attribute reduction needs to preserve a particular property of the 
decision system. The feature selection relies on the only cost information. 



4. Algorithm 

In this section, the backtracking algorithm to the minimal cost feature se- 
lection problem (MCFS) is illustrated in Algorithm [I] To invoke the algorithm, 
one should initialize the global variables as follows: R = is a feature subset 
with minimal total cost; cmc — mc(U, R) is currently minimal cost; and use 
the following statement: backtrack(i?, 0). The result of a feature subset with 
minimal total cost will be stored in R. 

Generally, the search space of the feature selection algorithm is 2' L In this 
context, a backtracking algorithm with pruning techniques is used to select a 
feature subset to minimize the total cost. There are essentially three pruning 
techniques employed in Algorithm [T] 

1. In Algorithm[TJ Line 1 indicates that the variable i starts from il instead of 



0. Whenever we move forward (see Line 14 ), the lower bound is increased. 
With this pruning technique, the solution space is 2l c l instead of |C|L 
In Algorithm [l] Lines [3] through [5] show the second pruning method. The 
misclassification costs are non-negative in the practical application. In 
this conditions, the feature subsets B will be discarded if the test costs 
of B is larger than the current minimal cost (cmc). This technique can 
prune most branches. 

Lines[6]through[8]indicate that if the new feature subset produce high cost, 
the current branch will never produce the feature subset with minimal 
total cost. 



5. Experiments 

Experiments are undertaken on four data sets from the UCI Repository of 
Machine Learning Databases, as listed in Table [3] We undertake three groups 
of experiments from different viewpoints. 
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Algorithm 1 A backtracking algorithm to the MCFS problem 

Input: (U,C,d,V, I,n,tc,mc), select tests R, current level test index lower 

bound I 

Output: A set of features R with minimal total cost and AMC, they are global 
variables 

Method: backtracking 
1: for (i = I; i < \C\; i + +) do 
2: B = RU{a,i} 
3: if (tc(B) > cmc) then 

4: continue; //Pruning for too expensive test costs 
5: end if 

6: if {ATC{U, B) > ATC(U, R)) then 

7: continue; //Pruning for non-decreasing total cost 

8: end if 

9: if (ATC(U, B) < cmc)) then 

10: cmc = ATC(U, B); / /Update the minimal total cost 

11: R = B:J /Update the set of features with minimal total cost 

12: end if 

13: end for 

14: backtrack(i?, i + 1); 



Table 3: Data sets information. 



No. 


Name 


Domain 


\u\ 


\c\ 


D = {d} 


1 


Liver 


clinic 


345 


6 


selector 


2 


Credit 


commerce 


690 


15 


class 


3 


Iono 


physics 


351 


34 


class 


4 


Diab 


clinic 


768 


8 


class 



From the viewpoint of different misclassification costs settings, we undertake 
two sets of experiments. First, we assume that the misclassification costs are 
different from each other. Table|4]is the optimal feature subset based on different 
misclassification costs for Diab data set. The ratio of two misclassification costs 
is set 10 in this experiment. As shown in this table, when the misclassification 
costs are too large compared with test costs, the test costs increase and even 
equal to the average total cost. In this case, the MCFS problem coincides with 
the MTR problem. Second, misclassification costs are identical for different 
misclassification. Table [5] shows the optimal feature subset based on unified 
misclassification cost. Since the misclassification costs are small enough, the 
algorithm chooses a feature subset with minimal average total cost. 

From the efficiency of the algorithm, we design two sets of experiments to 
evaluate performance of pruning technique. In this set of experiments, mis- 
classification costs are identical for different misclassification. In this case, we 
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Table 4: The optimal feature subset based on different misclassification costs. 



lvllS\_yOSl± 


A/liar 1 r,ct9 
1V11SOOSIZ 


-Lcol COSLS 


Total cost 


Feature subset 


1ZU 


i onn 


r nn 

D.UU 


9.75 


[1,2,3] 


1 a n 
14U 


i a nn 

14UU 


£. nn 
D.UU 


10.38 


[1,2,3] 


1 fin 
1DU 


1DUU 


£. nn 
D.UU 


11.00 


[1,2,3] 


1 en 
loU 


i Qnn 
loUU 


p. nn 
D.UU 


11.63 


[1,2,3] 


onn 
ZUU 


ZUUU 


£. nn 
D.UU 


12.25 


[1,2,3] 


oon 
ZzU 


oonn 
zzUU 


1 n nn 
1U.UU 


12.58 


[1,3,6] 


oa n 
Z4U 


oa nn 
Z4UU 


1 n nn 
1U.UU 


12.81 


[1,3,6] 


ZDU 


ZDUU 


i q nn 
lo.UU 


13.00 


[1,2,6] 


Table 5: 


The optimal feature 


i subset based 


on unified misclassification cost. 


MisCostl 


MisCost2 


Test costs 


Total cost 


Feature subset 


120 


120 


6.00 


9.59 


[1,2,3] 


140 


140 


6.00 


10.19 


[1,2,3] 


160 


160 


10.00 


10.42 


[0,1,2,3] 


180 


180 


10.00 


10.47 


[0,1,2,3] 


200 


200 


10.00 


10.52 


[0,1,2,3] 


220 


220 


10.00 


10.57 


[0,1,2,3] 


240 


240 


10.00 


10.63 


[0,1,2,3] 


260 


260 


10.00 


10.68 


[0,1,2,3] 



propose Fast approach and Slow approach. The Fast approach is a backtracking 
algorithm with three pruning methods, which is given by the Algorithm [T] We 
also propose the Slow approach without the third pruning method. The first 
set studies the change of backtracking steps with the misclassification costs. 
Figure [l] shows the backtracking steps of the algorithm. The second set studies 
the change of run time with the misclassification costs. Figure [2] shows the run 
time. From these two figures we observe the effectiveness of the third pruning 
technique. 

From the costs viewpoint, the changes of test costs and the average minimal 
total cost are shown in Figure [3j In real world, we could not select expensive 
tests when misclassification costs are low. Figure [3] shows this situation clearly. 

6. Conclusion and further works 

In this paper, we take data with normal distribution measurement errors 
into account and study feature selection with minimal cost. This new feature 
selection problem, called minimal cost feature selection (MCFS), has a wide 
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Slow 
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(a) 



-*- ■ Fast 
O Slow 



10 40 70 100 130 160 

Misclassificalion cost setting for Credit dataset 



(b) 



O O <) o c o 
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. O G-Q--Q 



25- O-O/ 



10 40 70 100 130 160 

Misclassitication cost setting for lono dataset 



10 40 70 100 130 160 190 

Misclassitication cost setting tor Wdbc dataset 



(c) (d) 
Figure 1: Backtracking steps: (a) Liver; (b) Credit; (c) lono; (d) Diab 

application area for two reasons. From the viewpoint of the data, measurement 
errors under considered are ubiquitous. From the viewpoint of the minimal cost 
problem, the resource one can afford is often limited. In order to obtain the 
optimal result, a backtracking algorithm with three effective pruning techniques 
is designed for MCFS problem. Experimental results show that the pruning 
techniques are effective. This work also serves as the benchmark for other 
heuristic algorithms which should be designed in our further works for large 
data sets. 
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Figure 2: Run time: (a) Liver; (b) Credit; (c) Iono; (d) Diab 
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